You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
The objective of this analysis is to learn more about the consumer (focused for women audience) data behaivour to help guide the marketing team.
This section supposed to have list of questions and answer that's happening when there's a kickoff meeting about the project. But, Since this is a case study, there's no QnA section. So we will skip this step. But if this step exist, This section goals is to know the objective of our Analysis Project and get to know more about the project.
What to do in this section:
import os
import pandas as pd
# ignore any warning
import warnings
warnings.filterwarnings("ignore")
raw_data_path = "../raw_data/"
file_names = os.listdir(raw_data_path)
file_path = [f'{raw_data_path}{filename}' for filename in file_names]
print(file_names)
# read all data
# Intensities
df_minuteIntensitiesWide = pd.read_csv(file_path[0])
df_minuteIntensitiesNarrow = pd.read_csv(file_path[4])
df_hourlyIntensities = pd.read_csv(file_path[7])
df_dailyIntensities = pd.read_csv(file_path[17])
# Calories
df_minuteCaloriesWide = pd.read_csv(file_path[13])
df_minuteCaloriesNarrow = pd.read_csv(file_path[11])
df_hourlyCalories = pd.read_csv(file_path[9])
df_dailyCalories = pd.read_csv(file_path[8])
# Step
df_minuteStepsWide = pd.read_csv(file_path[15])
df_minuteStepsNarrow = pd.read_csv(file_path[14])
df_hourlySteps = pd.read_csv(file_path[6])
df_dailySteps = pd.read_csv(file_path[12])
# Sleep
df_sleepDay = pd.read_csv(file_path[2])
df_minuteSleep = pd.read_csv(file_path[10])
# HeartRate
df_heartrate_seconds = pd.read_csv(file_path[3])
# Standalone
df_minuteMETsNarrow = pd.read_csv(file_path[5])
df_weightLogInfo = pd.read_csv(file_path[16])
df_dailyActivity = pd.read_csv(file_path[1])
df_dict = {
'df_minuteIntensitiesWide' : df_minuteIntensitiesWide,
'df_minuteIntensitiesNarrow' : df_minuteIntensitiesNarrow,
'df_hourlyIntensities' : df_hourlyIntensities,
'df_dailyIntensities' : df_dailyIntensities,
'df_minuteCaloriesWide' : df_minuteCaloriesWide,
'df_minuteCaloriesNarrow' : df_minuteCaloriesNarrow,
'df_hourlyCalories' : df_hourlyCalories,
'df_dailyCalories' : df_dailyCalories,
'df_minuteStepsWide' : df_minuteStepsWide,
'df_minuteStepsNarrow' : df_minuteStepsNarrow,
'df_hourlySteps' : df_hourlySteps,
'df_dailySteps' : df_dailySteps,
'df_sleepDay' : df_sleepDay,
'df_minuteSleep' : df_minuteSleep,
'df_heartrate_seconds' : df_heartrate_seconds,
'df_minuteMETsNarrow' : df_minuteMETsNarrow,
'df_weightLogInfo' : df_weightLogInfo,
'df_dailyActivity' : df_dailyActivity
}
# this will help me to see all the variable that holds the table data
def getTableNames():
global df_dict
for df_name, df in df_dict.items():
print(df_name)
def CheckDataFormat(df, df_name, is_wide=False):
# I will check wether it's wide or not, because wide have lot of columns
# and it's filling most of the sceen. so we'll seperate it
if (('Wide' in df_name) == is_wide):
print(f'Data Format for {df_name}')
display(df.head(2))
display(df.info())
for df_name, df in df_dict.items():
CheckDataFormat(df, df_name, is_wide=False)
for df_name, df in df_dict.items():
CheckDataFormat(df, df_name, is_wide=True)
As it turns out, most of the data have no nulls in them but df_weightLogInfo, with a null value for almost every data points in the fat feature. Another thing to point out is that every data have an ID atribute which I belive it's a foreign key from a user table. Sadly there's no user table from this data, but we can use the ID to join most of the data together. All of the data also contains datetime data that's stored as a string, so we need to change it to datetime so we can manipulate it, and it will be easier for us to visualize later.
Now, we're going to transform our datetime string data into a datetime format. To do that we need to know what column does have the datetime format. to do this we can use string matching, but after looking at the dataset, there's only 1 column with object datatype and it's the datetime data! it'll save a little bit of time rather than doing a string manipulation.
def StringtoDate(df):
object_columns = df.select_dtypes(include=['object']).columns
for object_column in object_columns:
df[object_column] = pd.to_datetime(df[object_column])
# use "_" because there's a column called Date already
df['Date_'] = df[object_column].apply(lambda dt:dt.date())
# Time is a datetime with a placeholder date (ignore the date)
# I'm not using .time() because I can't seem to find a way to visualize it
df['Time_'] = df[object_column].apply(lambda dt:dt.replace(day=1,year=2000,month=1))
for df_name, df in df_dict.items():
StringtoDate(df)
print(f'{df_name} StringtoDate Done!')
I also could've change ID to string instead of number. But it doesn't really effect anything, because we're not going to manipulate it
Because this is a case study, I don't have the luxury to have a question to anyone. But, if I were able to ask, I would like to ask
What to do in this section:
as we already check, the data is already clean from null data. So, let's take a look at the datatime data, because there might be something off / unusual about it.
# import library that'll be used
import plotly.express as px
import seaborn as sns
import math
import numpy as np
def plot_minute_distribution(df, df_name):
# I'm not going to analyze wide data format, because it has the same value with narrow,
# And I'm just more comfortable with narrow data format
if ('minute' in df_name) & ('Wide' not in df_name):
temp = df.copy()
temp['count'] = 1
temp = temp[['Time_', 'count']].groupby('Time_').sum()
temp10min = temp.resample('10min').mean()
fig = px.bar(
temp10min,
x=temp10min.index,
y='count'
)
fig.update_layout(
title=f'Minute data distribution for {df_name}',
xaxis_title='Time (in a day)',
yaxis_title='Count'
)
fig.show()
for df_name, df in df_dict.items():
plot_minute_distribution(df, df_name)
it turns out that every minute data except sleep is always recorded, but sleep data only have the data when there's a sleeping activity detected.
def plot_daily_distribution(df, df_name):
# I'm not going to analyze wide data format, because it has the same value with narrow,
# And I'm just more comfortable with narrow data format
if ('minute' not in df_name):
temp = df.copy()
temp['count'] = 1
temp = temp[['Date_', 'count']].groupby('Date_').sum()
fig = px.bar(
temp,
x=temp.index,
y='count'
)
fig.update_layout(
title=f'Daily data distribution for {df_name}',
xaxis_title='Time (in a day)',
yaxis_title='Count'
)
fig.show()
for df_name, df in df_dict.items():
plot_daily_distribution(df, df_name)
from the plot above, we all got an information that every data date start and end the same month. But we have weight data that we only have a little bit of data points, Let's check how much ID does this data cover and what the weight data looks like!
fig = px.line(df_weightLogInfo,
x='Date_',
y="WeightKg",
title='Weight Data plot',
color='Id',
)
fig.add_scatter(x=df_weightLogInfo['Date_'],
y=df_weightLogInfo['WeightKg'],
mode='markers',
marker_size=5,
fillcolor='red'
)
fig.show()
print(f'number of unique ID : {df_weightLogInfo["Id"].nunique()}')
print(f'df_weightLogInfo data point couts : {len(df_weightLogInfo)}')
print('Weight ID value counts:')
display(df_weightLogInfo['Id'].value_counts())
There's only 8 Id (out of 33) that have weight data within 67 data points, and 54 of those belongs to 2 Id. Afterall, those weight data doesnt have any significant changes, so we might not be able to get anything from changes of the weight data. So, I will not use the weight data. It's a bad data and can lead to a bias insight. Since our goal is to have a better marketing, and weight is not an option. There's 2 other data that have some correlation with weight, which is Calories and METs. we will cover that later in the Analyze step.
Next, I will make cluster manualy for main features of the selected daily table which is calories, distance, And steps. I'm not including intensities because it has cluster already, and I also doesn't create heartrate cluster, because from my research from several article, every people have their own resting and active heartrate.
For Calories, I will divide it into 9 cluster, which start with less than 1k calories burn, and ends with more than 4.5k calories burn. in between I split it evenly with 500 differences, so we have 1k-1.5k, 1.5k-2k and so on.
df_minuteCaloriesNarrow
calories_clust = ['< 1000'] + [f'{(n-1)*500} - {n*500}' for n in range(2, 10)] + ['> 4500']
def getCaloriesCluster(x):
global cluster
if x > 4500:
return '> 4500'
if x < 1000:
return '< 1000'
else:
return calories_clust[math.floor(x/500)]
for df_name, df in df_dict.items():
# I'll only apply it to data that have calories as a feature, and holds the data daily
if ('Calories' in df.columns) & ('minute' not in df_name) & ('hour' not in df_name):
df['CaloriesCluster'] = df['Calories'].apply(getCaloriesCluster)
For Distance and Steps, I'm going to cluster them based on the quantile value. so, it'll have almost the same amount of data for every cluster.
def createDistanceCluster(x):
if x <= 3:
return "0 - 3 Miles"
if x <= 6:
return "3 - 6 Miles"
if x <= 8:
return "6 - 8 Miles"
if x > 8:
return "> 8 Miles"
else:
return "no cluster"
distance_cluster_order = ["0 - 3 Miles", "3 - 6 Miles", "6 - 8 Miles", "> 8 Miles"]
df_dailyActivity['distanceCluster'] = df_dailyActivity['TotalDistance'].apply(createDistanceCluster)
def createStepCluster(x):
if x <= 3800:
return "0 - 3.8k steps"
if x <= 7500:
return "3.8 - 7.5k steps"
if x <= 11000:
return "7.5k - 11k steps"
if x > 1100:
return "> 11k steps"
else:
return "no cluster"
step_cluster_order = ["0 - 3.8k steps", "3.8 - 7.5k steps", "7.5k - 11k steps", "> 11k steps"]
df_dailyActivity['stepCluster'] = df_dailyActivity['TotalSteps'].apply(createStepCluster)
I will also make a new feature to know which day of week is it, and check wether it's weekend or not. Because, people usually have a schedule on when they wanted to work out.
df_dailyActivity['ActivityDate'] = pd.to_datetime(df_dailyActivity['ActivityDate'])
df_dailyActivity['dayofweek'] = [d.weekday() for d in df_dailyActivity['ActivityDate']]
df_dailyActivity['isWeekend'] = df_dailyActivity['dayofweek'].apply(lambda x:x>=5)
# dayofweek still represented in 0-6 value, Let's change it
dayofweek_dict = {
0 : 'Monday',
1 : 'Tuesday',
2 : 'Wednesday',
3 : 'Thursday',
4 : 'Friday',
5 : 'Saturday',
6 : 'Sunday'
}
df_dailyActivity['dayofweek'] = df_dailyActivity['dayofweek'].apply(lambda x:dayofweek_dict[x])
for sleep time, I will divide it into 3 cluster, 0-7 hours as less optimal sleep, 7-9 hours as enough sleep and 9+ hours for oversleep. I'm using those number based on my personal research on google, and most of the article that I read stated that 7-9 hours is how much an 18+ years old human needs to sleep for.
activeMinutesCol = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes']
df_dailyActivity['TotalActiveMinutes'] = df_dailyActivity[activeMinutesCol].sum(axis=1)
What to do in this section:
Our focus for this project is to market it better. For a healthy app, the purpose is to track user behaviour and tell the user on how healthy the user is. Based on my knowledge, there are 2 things that'll have a huge impact on your health state. It's "how active you are", and "what you consume". With this app, we can track one of the factor, which is "how active you are". to determine how active a user is from the data we have, we can see it from their calories burn and their metabolic rate (METs).
# data visualization library
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
# statistics library
import numpy as np
from scipy.stats import pearsonr
from datetime import datetime
fig = px.histogram(
df_dailyActivity,
x="Calories",
marginal="rug"
)
fig.update_layout(
title=f'Calories burn distribution within a day',
xaxis_title='Calories Burn',
yaxis_title='Count'
)
fig.show()
let's also look at the data distribution for calories should look like

The data that we have might consist of both male and female audience. and there's no way of distinguishing those with calories data. But, we have a the data distribution looks like the one that I found on the internet, and it's a good thing. Next, we'll have a look at how the calories burn and METs changes troughout a day. To do this, I will take an example of a user and look at the differences within a day (a day have 1440 minute).
sampleMETs_data = df_minuteMETsNarrow[df_minuteMETsNarrow['Id'] == 1503960366]
sampleCal_data = df_minuteCaloriesNarrow[df_minuteCaloriesNarrow['Id'] == 1503960366]
display(sampleCal_data.head(3))
print(f'number of Calories data point (sample) : {len(sampleCal_data)}')
print(f'calories max value: {sampleCal_data["Calories"].max()}')
display(sampleMETs_data.head(3))
print(f'number of METs data point (sample) : {len(sampleMETs_data)}')
print(f'METs max value : {sampleMETs_data["METs"].max()}')
the feature activityMinutes for both of the table is exactly the same, so we can easiely get the first 1440 minutes (a day) and look at the difference. But since METs have a higher value than calories, I will normalize it in respect to calories, so both of them holds the same range of value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=sampleCal_data.head(1440)['ActivityMinute'], y=sampleCal_data.head(1500)['Calories'],
mode='lines',
name='Calories'))
fig.add_trace(go.Scatter(x=sampleMETs_data.head(1440)['ActivityMinute'], y=(sampleMETs_data.head(1500)['METs']/10) + 9,
mode='lines',
name='METs (normalized)'))
fig.update_layout(title='Calories and METs (normalized) changes within a day',
xaxis_title='Timeline',
yaxis_title='Value')
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
corr, _ = pearsonr(df_minuteMETsNarrow['METs'], df_minuteCaloriesNarrow['Calories'])
print(f'Calories and METs Correlation : {corr}')
so Calories and METs is almost perfectly positive correlated. because it's highly correlated, we can use of the feature to represent the other feature that's highly correlated. In this case, I will use calories. I choose it because it's a more known term rather than METs. Next, I will check wether certain days have an impact for calories burn.
fig = px.box(df_dailyActivity, x="dayofweek", y="Calories")
fig.update_layout(
title='Calories Burn based on the day',
xaxis_title='Day',
yaxis_title='Calories Burn'
)
fig.show()
# ------------------------
fig = px.box(df_dailyActivity, x="isWeekend", y="Calories")
fig.update_layout(
title='Calories burn during weekdays and weekend',
xaxis_title='Is it weekend?',
yaxis_title='Calories Burn'
)
fig.show()
based on the day of week, It's hard to spot any pattern. But for weekdays vs weekend, people are tend to be more active. We can see that from the lower tail of the plot for weekend is higher, and the overall box range is smaller. Which means that most of the data lies between those boxes. But after more analysis, I think we can't state "in weekend people tend to be more active", why? because the lower tail changes from 403 on weekdays to 1032 on weekend, And based on the calories burn for men and women that we had shown, there's <1% of women that burns that amount of calories. so these changes might be happening because of misscalculation or the app doesn't calculate 24 hours of the user calories burn, so it's less than what it's supposed to be.
Next, I will check on distance and steps feature!
df_dailyActivity[['TotalSteps', 'TotalDistance', 'TotalActiveMinutes', 'Calories']].describe().T
fig = px.scatter(
df_dailyActivity,
x="TotalSteps",
y="TotalDistance",
trendline="ols",
)
fig.update_layout(
title="Total Step vs Total Distance",
xaxis_title='Total Step',
yaxis_title='Total Distance (Miles)'
)
fig.data[1].line.color = 'red'
fig.show()
as the step goes up, the distance goes up aswell. We can draw a conclusion that it's highly correlated possitively. But we can see some of the data is a little bit off from the other. Let's take a closer look.
df_high_dist_steps = df_dailyActivity.loc[(
df_dailyActivity['TotalSteps'] > 15000) | (df_dailyActivity['TotalDistance'] > 10)]
df_high_dist_steps["Id"] = df_high_dist_steps["Id"].apply(str)
fig = px.scatter(df_high_dist_steps,
x="TotalSteps",
y="TotalDistance",
color="Id",
title="Total Distance vs Total Step (Zoomed in on higher value)")
fig.show()
it turns out that all of those "strange" data is caused by one user only. This might be caused by an error in the system, or this user might be built differently. So he can reach more distance with less step than others.
Since there's a little gap between steps and distance, my instinct tells me that it is because of running will lead to higher travel distance than walking. So, this information about step and distance differences can represent the intensities of the steps.
fig = px.box(df_dailyActivity, x="stepCluster", y="Calories",
facet_col="distanceCluster", category_orders={
"stepCluster": step_cluster_order,
"distanceCluster": distance_cluster_order
},
title='Distance and Steps vs Calories'
)
fig.show()
from the visualization above, we can see that for a cluster of distance, the less steps have a higher median calories burn. So, from this visualization we have an information that the phase/intensity of the walk is more important than the distance. and also, the difference between the clusters is quite huge (3 - 5km difference).
Next, let's check on how steps, distance, and active minutes will effect calories burn. But, I will skip distance because we already know that steps & distance is highly correlated, so the graph will look very similar.
fig = px.scatter(df_dailyActivity,
x="TotalSteps",
y="Calories",
trendline="ols",
title="Calories vs Total Steps",
hover_data=["Id"]
)
fig.data[1].line.color = 'red'
fig.show()
fig = px.scatter(df_dailyActivity,
x="TotalActiveMinutes",
y="Calories",
trendline="ols",
title="Calories vs Total Active Minutes",
hover_data=["Id"]
)
fig.data[1].line.color = 'red'
fig.show()
it turns out that total steps have more effect than total active minutes. Since we know that total active minutes is calculated from 3 level of active minutes. Let's take a look at how much impact does each level has on calories.
df_activeMinutes_summary = df_dailyActivity[
['CaloriesCluster',
'VeryActiveMinutes',
'FairlyActiveMinutes',
'LightlyActiveMinutes',
'Calories',
]
].groupby("CaloriesCluster").mean().reset_index()
# data cluster wasn't sorted out correctly, so we manually sort it
# the problem is that the calorie cluster <1000 is placed at the end
# where it should be at the very first.
# and this will effect the visualization sequence later.
df_activeMinutes_summary["sequence"] = range(1,len(df_activeMinutes_summary)+1)
df_activeMinutes_summary.loc[df_activeMinutes_summary.index==7, 'sequence'] = 0
df_activeMinutes_summary = df_activeMinutes_summary.sort_values("sequence").drop('sequence', axis=1).reset_index(drop=True)
df_activeDistance_summary = df_dailyActivity[
['CaloriesCluster',
'VeryActiveDistance',
'ModeratelyActiveDistance',
'LightActiveDistance',
'Calories',
]
].groupby("CaloriesCluster").mean().reset_index()
# data cluster wasn't sorted out correctly, so we manually sort it
# the problem is that the calorie cluster <1000 is placed at the end
# where it should be at the very first.
# and this will effect the visualization sequence later.
df_activeDistance_summary["sequence"] = range(1,len(df_activeDistance_summary)+1)
df_activeDistance_summary.loc[df_activeDistance_summary.index==7, 'sequence'] = 0
df_activeDistance_summary = df_activeDistance_summary.sort_values("sequence").drop('sequence', axis=1).reset_index(drop=True)
def px_lineActivePlot(data, x_axis, col_names, name_seq):
fig = go.Figure(data=[
go.Line(name=name,
x=data[x_axis],
y=data[col_name],
text=round(data[col_name],1)
) for col_name, name in zip(col_names, name_seq)
])
fig.update_layout(
)
# Change the bar mode
fig.update_layout(
barmode='stack',
title=f"{x_axis} vs {name_seq[0][name_seq[0].find(' '):]}",
xaxis_title='Total Step',
yaxis_title='Total Distance (Miles)')
fig.show()
col_names_minutes = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes']
name_seq_minutes = ['Very Active Minutes', 'Fairly Active Minutes', 'Lightly Active Minutes']
px_lineActivePlot(df_activeMinutes_summary, 'CaloriesCluster', col_names_minutes, name_seq_minutes)
col_names_dist = ['VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance']
name_seq_dist = ['Very Active Distance', 'Moderately Active Distance', 'Light Active Distance']
px_lineActivePlot(df_activeDistance_summary, 'CaloriesCluster', col_names_dist, name_seq_dist)
from the visualization above, we can conclude that the very active minutes or distance have a huge role on calories burn. This will strongly support our previous findings that lower steps with high distance lead to more calories burn.
Let's check on when user is most active
fig = px.line(
df_hourlyIntensities.groupby('Time_').mean(),
x=df_hourlyIntensities.groupby('Time_').mean().index,
y="AverageIntensity",
title='Average intensity (Hourly)')
fig.show()
fig = px.line(
df_minuteIntensitiesNarrow.groupby('Time_').mean(),
x=df_minuteIntensitiesNarrow.groupby('Time_').mean().index,
y="Intensity",
title='Average intensity (Minutes)')
fig.show()
so the user is mostly active from 17:00 to 19:00, From this info I can make a recommendation for when to do the marketing campaign. Next, let's take a look at the sleep feature, and see wether sleeping will effect how active you are for the day.
df_activity_join_sleep = df_sleepDay.merge(df_dailyActivity, left_on=['Id', 'Date_'], right_on=['Id', 'Date_'])
display(df_activity_join_sleep.head(2))
print(f'number of unique Id : {df_activity_join_sleep["Id"].nunique()}')
let's check on how total minutes asleep will effect total time in bed
fig = px.scatter(df_activity_join_sleep,
x="TotalMinutesAsleep",
y="TotalTimeInBed",
trendline="ols",
title="Total Minutes Asleep vs Total Time In Bed",
hover_data=["Id"]
)
fig.data[1].line.color = 'red'
fig.show()
so when you're asleep, it's also counted as time in bed. So there's no way total time in bed is lower than total minutes asleep. Next let's check on how sleep time will effect how active you are for the day.
# make it into hour will also make it easier to understand
df_activity_join_sleep['TotalHoursAsleepRounded'] = round(df_activity_join_sleep['TotalMinutesAsleep']/60)
colors = ['#3a5eff', '#ff3a3a', '#1ccc02'] # for better continuity from the visualization before
temp_df = df_activity_join_sleep.sort_values(by=['TotalHoursAsleepRounded'])
fig = go.Figure(data = [
go.Box(
x=temp_df['TotalHoursAsleepRounded'],
y=temp_df[col_name],
name=name,
marker_color=color
) for col_name, name, color in zip(col_names_minutes, name_seq_minutes, colors)
])
fig.update_layout(
title='Active Minutes vs Total Hours Asleep',
yaxis_title='Active Minutes',
xaxis_title='Total Hours Asleep',
boxmode='group' # group together boxes of the different traces for each value of x
)
fig.show()
lightly active minutes really distort the other data that we already know is more important, especially the very active minute. Since it's the one that affect calories the most
colors = ['#3a5eff', '#ff3a3a', '#1ccc02']
temp_df = df_activity_join_sleep.sort_values(by=['TotalHoursAsleepRounded'])
fig = go.Figure(data = [
go.Box(
x=temp_df['TotalHoursAsleepRounded'].apply(str),
y=temp_df[col_name],
name=name,
marker_color=color
) for col_name, name, color in zip(col_names_minutes[:2], name_seq_minutes[:2], colors[:2])
])
fig.update_layout(
title='Active Minutes (Lightly Active Minutes Excluded) vs Total Hours Asleep',
yaxis_title='Active Minutes',
xaxis_title='Total Hours Asleep',
boxmode='group' # group together boxes of the different traces for each value of x
)
fig.show()
Since we've exluded sedentary minutes in the analysis above, I'm going to check wether sedentary minutes will have any effect on sleep time
fig = px.box(df_activity_join_sleep, x='TotalHoursAsleepRounded', y='SedentaryMinutes')
fig.update_layout(
title='Sedentary Minutes vs Total Hours Asleep',
yaxis_title='Sedentary Minutes',
xaxis_title='Total Hours Asleep',
boxmode='group' # group together boxes of the different traces for each value of x
)
fig.show()
interestingly, user with 1 hours of sleep have more very active minutes. this might be because we only have small amount of data with extreme value (very low or very high) for hour of sleep. but, as expected, there's an increase of very active minutes for use that have 5-8 total hours of sleep. with this plot, we have a knowledge that if you sleep for more than 8 hours or less than 5 hours, it's less likely for you to have more very active minutes. And for the sedentary activity, It turns out that the user tends to have a bad sleep when they're doing lot of sedentary activity (little to no activity). This is an interesting findings and now I'm feeling very exited! Lets dive more into the sleeping feature. I have a hypothesis that sleeping schedule also have a role for how active you are. let's do just that! But first, let's try to identify when a user is considered "sleeping" from the data minutesSleep.
from the data time distribution back at the process step, we already an intuition that df_minuteSleep only have the record when the user is considered sleeping, because we can see that most of the data lies between sleep time. But, I will try prove it in this section with the log_id. but first, let's try to find someone with more than 1 total sleep records within a day and take it as a sample.
df_activity_join_sleep.loc[df_activity_join_sleep['TotalSleepRecords'] > 1].head(1)
minuteSleep_data_sample = df_minuteSleep.loc[(df_minuteSleep['Id'] == 1503960366) &
(df_minuteSleep['Date_'] == datetime(2016,4,13).date())]
print(f'number of unique logId : {minuteSleep_data_sample["logId"].nunique()}')
so data with 2 total sleep records have 2 unique logs id. so we can group by a data with Id and logId, and get the max time for wake up time, and min time for the sleep start time.
df_sleepTime = df_minuteSleep.groupby(['Id', 'logId']).agg({
'Date_' : ['max'],
'Time_' : ['min', 'max'],
}).reset_index()
# renaming columnm
new_col_names = list(df_sleepTime.columns.droplevel(1))
new_col_names[3:5] = ('sleepStart', 'sleepEnd')
df_sleepTime.columns = new_col_names
df_sleepTime.head(2)
filter_column_list = ['Id', 'logId', 'Date_', 'sleepStart', 'sleepEnd', 'Calories']
df_sleepTime_with_calories = df_sleepTime.merge(
df_dailyActivity,
left_on=['Id', 'Date_'],
right_on=['Id', 'Date_']
)[filter_column_list]
df_sleepTime_with_calories.head(2)
def getTimeHour(x):
return x.hour
df_sleepTime_with_calories['sleepStartRounded'] = df_sleepTime_with_calories['sleepStart'].apply(getTimeHour)
df_sleepTime_with_calories['sleepEndRounded'] = df_sleepTime_with_calories['sleepEnd'].apply(getTimeHour)
fig = px.box(df_sleepTime_with_calories, x="sleepStartRounded", y="Calories")
fig.update_layout(
title='Calories vs Sleep start time',
yaxis_title='Calories',
xaxis_title='Sleep start time',
)
fig.show()
fig = px.box(df_sleepTime_with_calories, x="sleepEndRounded", y="Calories")
fig.update_layout(
title='Calories vs Wake up time',
yaxis_title='Calories',
xaxis_title='Wake up time',
)
fig.show()
this is strage, why? because our data shows that 75% of our data have atleast 6 hours of sleep, the fact that there's a lot of data with sleep start at hour 0 and a lot of data with sleep end at hour 23 doesn't make sense.
I WAS WRONG! later, I actually notice that this is not an error. I made a mistake on not considering that sleeping usually happends within 2 different days, so I should've group it by the sleep log_id. But, I won't delete this mistake that I've just made. because from this, I learn that I will always double check on interesting findings. Imagine if this is real life scenario, I didn't notice this mistake, and I must share this findings to the product team. The ending will not be good.
df_sleepTime = df_minuteSleep.groupby(['Id', 'logId']).agg({
'date' : ['min', 'max'],
}).reset_index()
# renaming columnm
new_col_names = list(df_sleepTime.columns.droplevel(1))
new_col_names[2:4] = ('sleepStart', 'sleepEnd')
df_sleepTime.columns = new_col_names
df_sleepTime['Date'] = df_sleepTime['sleepEnd'].apply(lambda dt:dt.date())
display(df_sleepTime.head(2))
display(df_dailyActivity[['Id', 'ActivityDate', 'Calories']].head(2))
filtered_column_list = ['Id', 'logId', 'Date_', 'sleepStart', 'sleepEnd', 'Calories']
df_sleepTime_with_calories = df_sleepTime.merge(
df_dailyActivity,
left_on=['Id', 'Date'],
right_on=['Id', 'Date_']
)
df_sleepTime_with_calories[filtered_column_list]
def getTimeHour(x):
return x.hour
df_sleepTime_with_calories['sleepStartRounded'] = df_sleepTime_with_calories['sleepStart'].apply(getTimeHour)
df_sleepTime_with_calories['sleepEndRounded'] = df_sleepTime_with_calories['sleepEnd'].apply(getTimeHour)
fig = px.box(df_sleepTime_with_calories, x="sleepStartRounded", y="Calories")
fig.update_layout(
title='Calories vs Sleep start time',
yaxis_title='Calories',
xaxis_title='Sleep start time',
)
fig.show()
fig = px.box(df_sleepTime_with_calories, x="sleepEndRounded", y="Calories")
fig.update_layout(
title='Calories vs Wake up time',
yaxis_title='Calories',
xaxis_title='Wake up time',
)
fig.show()
from the data above, we know that user who wake up earlier in the morning (especially at 4 to 5) tends to burn more calories
fig = px.histogram(df_sleepTime_with_calories, x="sleepStartRounded", nbins=20)
fig.update_layout(
title='Sleep start data distribution',
xaxis_title='Wake up time',
)
fig.show()
Next, let's take a look at our last data that we'll be analyzing, it's heartrate. We only have less than 50% of Id that have the data. But since this is an important feature, I will do a little bit of analysis and take a look at a sample from a user that has the highest and lowest average value of heartrate.
df_heartrate_seconds['Id'] = df_heartrate_seconds['Id'].apply(str)
df_heartrate_seconds.head(2)
df_heartrate_seconds.groupby(['Id', 'Date_']).mean()
fig = px.box(
x=df_heartrate_seconds.groupby(['Id', 'Date_']).mean().index.get_level_values(0),
y=df_heartrate_seconds.groupby(['Id', 'Date_']).mean()['Value']
)
fig.update_layout(
title='Heart rate value range for each user',
yaxis_title='Heart rate value Range',
xaxis_title='User ID',
)
fig.show()
heartrate_seconds_sample_high = df_heartrate_seconds.loc[(df_heartrate_seconds['Id'] == '6775888955')]
heartrate_seconds_sample_low = df_heartrate_seconds.loc[(df_heartrate_seconds['Id'] == '4388161847')]
heartrate_seconds_sample_high.head(2)
fig = px.line(
heartrate_seconds_sample_high,
x='Time',
y="Value",
title='Heart rate data from a user with the highest average daily heart rate'
)
fig.show()
fig = px.line(
heartrate_seconds_sample_low,
x='Time',
y="Value",
title='Heart rate data from a user with the lowest average daily heart rate'
)
fig.show()
I know that the visualization above is not optimal and hard to read, but from the user with high average heart rate, the data is not complete and have a lot of missing data. Not only we have some Id that doesn't have heartrate data, but the data also have a lot of missing data on it.
what to do in this section:
The best time to run the marketing campaign
Who to target
Educational content topics
sleepDay, but the column that hold the date is called "SleepDay" and it store the data with a datetime format "4/12/2016 12:00:00 AM", but on the other daily updated table, they all have the same column name called ActivityDay with a date format "4/12/2016". For table name, only 1 table naming is different, it's heartrate_seconds. All the other table name is using camel case naming convention.heartrate_seconds and minuteSleep table. for heartrate_sconds, I know that the value represent what is the user current heartrate at the time. But for minute minuteSleep, the value is from 1-3 and I have to do further research for what it means. And i found that the value 1 means the user is asleep, but what about 2 and 3?. My suggetion is change the column name from value to heartrate for heartrate_sconds table, and whatever the value in minuteSleep table represent.For both marketing and product team, I'll schedule a seperate meeting for each team, where I share what my findings are and what action that each team needed to do. For data team, I will also like to make a meeting about both of the points that I've just wrote and make a project planning on how and when to do that action, and I will be a part of it to make sure the Goal of each project is met.